Computational Biology and Chemistry — Latest Matching Preprints

1

Glycine molecule radical: Predicted properties and dipeptide formation

Synak, J.; Blazewicz, J.

2026-07-10 bioinformatics 10.64898/2026.07.07.736934 medRxiv

Top 0.1%

4.1%

Show abstract

Numerous advances in quantum and computational chemistry over the last decades, well as the development of computer science, allowed utilisation of more precise and complex models, which can be now applied to much bigger systems than in the past. The authors used Gaussian, coupled with theoretical methods, to predict a new way of peptide bond formation, which could have taken place in prebiotic conditions. To better tackle this difficult task, the properties of substrates (glycine-derived radicals) were extensively analysed, using the aforementioned tool - Gaussian, paired with taking resonance and hybridisation into account, to better understand the stereochemistry and the very nature of processes taking place. The result is a series of reactions, which without any sophisticated catalysts and with relatively low energy thresholds ({inverted exclamation}20 kcal/mol) can lead to formation of dipeptides (and further, oligopeptides). The authors also hope, the other predicted properties of the investigated molecules can be of use to any researcher, who would like to utilise them in their experiments. Author summaryOur goal was to investigate a way first peptide bonds in prebiotic conditions could have been formed. This is an extremely important step in research into the beginning of life on Earth. We found a very promising series of reactions, which uses atomic hydrogen as its only catalyst and confirmed our expectations with theoretical calculations, using Gaussian. There are two radicals derived from glycine, which perform major roles in the process, so we investigated their properties with Gaussian and verified that the results are in agreement with our own theoretical considerations. This involved checking for possible geometric isomers and conformers and creating models which could explain their properties. We are well aware that such calculations have limitations and there is no model, which is 100% accurate, so our results should be further confirmed by empirical data in the future. However, we still to be as thorough as possible in how we approached the subject.

2

The Gompertz curve for estimating growth rates of Protein Data Bank and protein folds

Sato, K.; TOMII, K.

2026-06-26 bioinformatics 10.64898/2026.06.24.732253 medRxiv

Top 0.7%

1.1%

Show abstract

The Protein Data Bank (PDB) is an ever-growing, open-access repository of structural data of biological molecules. This international database has been instrumental in the development of artificial intelligence and deep learning models for protein structure prediction and design. The PDB growth is a crucially important factor influencing further development of these models. Therefore, after analyzing the growth trend in PDB depositions since the archive's launch, we found that it is well fitted by the Gompertz function, a growth curve used across various disciplines. Furthermore, we observed that the function captures the "discovery of novel folds", i.e., the cumulative number of distinct folds among protein domains that constitute most of the PDB. Consequently, based on the fitting results, we estimated the likely numbers of PDB entries and protein folds. These findings provide insights into deceleration of growth in recent years and enable us to assess anticipated trends.

3

A Universal Immune Index (II): A Composite Quantitative Assessment Method and Calculation Tool for Immune Function Based on Multidimensional Routine Laboratory Parameters

zhang, Y.; LI, K.

2026-06-25 allergy and immunology 10.64898/2026.06.22.26356269 medRxiv

Top 0.7%

1.1%

Show abstract

Background: Quantitative assessment of immune function is essential for clinical and health decisions in oncology, post-surgical management, and autoimmune diseases. Existing methods are either too simplistic (single indicators) or too complex and costly for routine use. A standardized, easy-to-operate tool based on routine laboratory parameters is needed for both clinical and health checkup settings. Methods: We propose the Immune Index (II), integrating 9 routine laboratory parameters across three dimensions: humoral immunity (IgG, complement C3, C4), cellular immunity (CD4+ T cells, CD8+ T cells, CD4+/CD8+ ratio), and inflammatory response (CRP, IL-6, systemic immune-inflammation index [SII]). Indicators were normalized using min-max normalization to a 0-100 scale and aggregated with fixed weights (humoral 30%, cellular 40%, inflammatory 30%). The II score ranges from 0 to 100, with a healthy reference range of 50-80. Results: A four-tier grading system was established: >=80 (immune overactivation), 50-80 (immune homeostasis), 35-50 (mild immune suppression), <35 (severe immune deficiency). Validation using 209 cases from published literature showed an AUC of 0.924 (95% CI: 0.87- 0.97) for distinguishing normal from abnormal immune status, with an optimal cutoff of 47.8 (sensitivity 84.8%, specificity 85.9%). II scores were 56.7+/-8.6 (healthy), 43.5+/-8.0 (immunodeficient), and 33.6+/-6.5 (autoimmune), with P<0.001 between all groups. The calculation requires only two steps and can be implemented in Excel or LIS. II can serve as an immune dimension supplement for personal health checkups. Conclusion: The Immune Index provides a simple, standardized, and low-cost tool for quantitative immune function assessment. The fixed-weight design ensures cross-institutional comparability, making it suitable for outpatient clinics, health checkup centers, and primary care settings. Keywords: Immune index; immune function; quantitative assessment; routine laboratory parameters; composite score; min-max normalization

4

Characterization of ATM gene expression and evaluation of Reactive Oxygen Species in Silibinin-treated SKBR3 cells

Nademi, N. S.; Motamed, N.

2026-07-09 cancer biology 10.64898/2026.07.02.736131 medRxiv

Top 0.8%

1.0%

Show abstract

BackgroundReactive Oxygen Species (ROS) are the small, unstable and highly reactive species, having DNA oxidizing ability. Oxidation of the DNAs purine and pyrimidine bases can lead to single or double strands in this macromolecule. In this situation, the ATM molecule, a serine-threonine kinase, targets several proteins for phosphorylation, which causes the cell cycle to stop and the DNA damage repair begins. It has previously been proven that natural polyphenols have the cancer inhibiting properties due to their high efficacy and low side effects. Silibinin is the main herbal and medical ingredient in Milk Thistle (Silybum marianum) is a polyphenol flavonolignan, which has been widely considered as an antioxidant and anticancer agent. The purpose of the present study was to investigate the ATM gene expression and measurement of reactive oxygen species (ROS) in SKBR3 cell line, treated with Silibinin. Materials and MethodsAt first, the SKBR3 cell line was cultured in RPMI1640 culture medium and MTT assay was carried out to evaluate the Silibinin cytotoxicity. Flow Cytometry was carried out for cell cycle analysis, apoptotic induction, and ROS detection. While, Real Time PCR was used to evaluate the ATM gene expression in the Silibinin-treated and un-treated SKBR3 cells. ResultsPresent results have shown that 150 {micro}M Silibinin had the most significant cytotoxicity and apoptotic induction influence after the treatment period of 48 h. Flow cytometry data have shown that Silibinin induced considerable amount of apoptosis and caused cell cycle arrest at G1/S phase and induced production of ROS. Real-time PCR results have revealed that Silibinin increased the ATM expression in SKBR3 cell line. ConclusionSilibinin causes increased ATM gene expression by inducing ROS production, which initiates cell cycle arrest and apoptotic induction in SKBR3 cells line.

5

AptViralDB: A Repository of Experimentally Validated Antiviral Aptamers

Bajiya, N.; Singh, S.; Gahlot, P. S.; Raghava, G. P. S.

2026-07-11 bioinformatics 10.64898/2026.07.08.737144 medRxiv

Top 0.8%

1.0%

Show abstract

In an era of increasing drug resistance, exploring alternative molecules is crucial for the efficient management and treatment of viral diseases. Nucleic acid aptamers have emerged as highly promising candidates due to their exceptional target specificity, low immunogenicity, and versatile mechanisms for viral blocking. This manuscript describes AptViralDB, a manually curated database providing comprehensive information on experimentally validated antiviral aptamers. It contains 1,768 entries of antiviral aptamers against 40 viral species and 104 molecular targets, compiled from literature and existing databases. Each entry provides detailed annotations, including sequence, aptamer type, target, chemical modifications, binding affinity, antiviral activity, stability, and cytotoxicity. We also provide predicted secondary structures and their corresponding minimum free energy (MFE) values. Additionally, a knowledge graph created using ArcadeDB/openCypher enables users to seamlessly explore connections among aptamers, viruses, molecular targets, and biological activities. Finally, the platform offers advanced search and browsing tools, BLAST-based sequence similarity searches, GC-content analysis, downloadable datasets, and REST API access to support computational applications. (https://webs.iiitd.edu.in/raghava/aptviraldb/).

6

Tracing the regulatory atlas of non-coding RNA in human labour

Magateshvaren Saras, M. A.; Ahmad, S.; Smith, R.; Mitra, M. K.; Tyagi, S.

2026-07-07 bioinformatics 10.64898/2026.07.06.736857 medRxiv

Top 1%

0.9%

Show abstract

The early onset of labour increases mortality and developmental risks for a human newborn. Key genes in human labour have been investigated using multiple modalities, but their regulation by non-coding RNA (e.g. lncRNA and miRNA) remains incomplete. This study explores the three-way relationship between labour-associated transcription factors (TFs), miRNA and lncRNA suggested by the competing endogenous RNA (ceRNA) hypothesis, to understand the underlying regulatory framework. Experimentally validated miRNA-lncRNA interactions are modelled using five distinct machine learning (ML) architectures to predict 20469 labour-linked miRNA-lncRNA interactions. Known mRNA-ncRNA interactions from databases were included to construct a tripartite network, and a subset of 9989 labour-linked network motifs containing TFs were isolated and analysed. Gene enrichment of nodes in TF-lncRNA-miRNA network, as well as validation from public myometrial datasets indicate high significance in contractile pathways including immune signalling. Experimentally unconfirmed tripartite network motifs have been found, and we elaborate on their potential regulation in labour using 8 TF-lncRNA-miRNA network motifs. A unified ncRNA-TF regulatory atlas in labour has been synthesized, and a complete summary of the tripartite network motifs can be accessed and visualised using the user-friendly, public database.

7

AptCancerDB: A Curated Knowledgebase and Translational Discovery Platform for Anticancer Aptamers

Bajiya, N.; Singh, S.; Raghava, G. P. S.

2026-07-09 cancer biology 10.64898/2026.07.02.735999 medRxiv

Top 1%

0.9%

Show abstract

Aptamers are emerging as important molecular recognition ligands in oncology, playing significant roles in cancer diagnostics, targeted therapies, drug delivery systems, and molecular imaging. Numerous aptamers have advanced to clinical trials, indicating their potential for real-world applications; however, existing databases fail to capture that. To bridge this critical gap, we developed AptCancerDB (https://webs.iiitd.edu.in/raghava/aptcancerdb/), a comprehensive, manually curated database of experimentally verified anticancer aptamers. The current release contains 1,941 entries collected from studies published between 2000 and 2025, covering 29 cancer types, approximately 200 cancer cell lines, and direct links to 22 clinical trials. Each entry is annotated with sequence information, target details, cancer type, cell line, SELEX methodology, affinity determination data, chemical modifications, and biological activities. The dataset is dominated by 82.7% ssDNA, reflecting its superior stability and ease of synthesis, while only 16.6% is ssRNA and appears primarily in studies targeting complex intracellular or protein-protein interactions. To facilitate structural analysis, predicted secondary structures, dot-bracket notations, specific structural elements, and minimum free energy values were also included. AptCancerDB integrates a MySQL backend with an ArcadeDB/OpenCypher-based Knowledge Graph, enabling exploration of relationships among aptamers, targets, cancer types, cell lines, and functional applications. The platform provides advanced search and browsing facilities, BLASTn-based similarity searching, and GC Calculator. Built on a modern, responsive frontend (React/TypeScript/Tailwind CSS), the platform includes a REST API for data retrieval. By integrating fragmented experimental data into a unified cancer-focused resource, AptCancerDB serves as a valuable resource for comparative analysis, aptamer discovery, and the development of next-generation aptamer-based diagnostics and therapeutics. HighlightsO_LICurated knowledge base of experimentally validated anticancer aptamers. C_LIO_LIAptCancerDB contain therapeutic, tumor-homing and cell-penetrating aptamers. C_LIO_LISummarizes clinical progress and translational trends in anticancer aptamer research. C_LIO_LISupports rational aptamer design using molecular, functional, and clinical annotations C_LIO_LIDisease-focused resource for cancer diagnosis, therapy, and drug delivery C_LI TeaserAptCancerDB maintains experimentally validated anticancer aptamers relevant to diagnosis, drug delivery, and therapy.

8

Rewiring of EGFR oncogenic program by opposing actions of membrane versus soluble CD109 in HNSCC

Durgempudi, V.;Kungyal, T.;Hassan, A.;Nelea, V.;Finnson, K.;Reinhardt, D.;Sadeghi, N.;Philip, A.

2026-06-23 Cancer Biology 10.64898/2026.06.20.733552 medRxiv

Top 1%

0.6%

Show abstract

The epidermal growth factor receptor (EGFR) expression is often dysregulated in head and neck squamous cell carcinoma (HNSCC), driving cancer cell proliferation, invasion, and metastasis through diverse pathways, thereby contributing to aggressive chemo- and radio-therapy resistance. A GPI-anchored protein, CD109 is upregulated in multiple cancers, including HNSCC. While membrane-anchored CD109 (mCD109) is pro-tumorigenic in SCC via EGFR/STAT3 activation, the role of protease-cleaved soluble CD109 (sCD109) is poorly understood. Our groundbreaking findings demonstrate that sCD109 antagonizes EGFR signaling by directly binding to the EGFR extracellular domain, preventing mCD109-EGFR stabilizing interactions on the cell surface, followed by inhibition of EGFR phosphorylation at Y1068 and downstream signaling cascades (AKT, MAPK, and STAT3) consequently suppressing cancer cell migration, invasion, 3D tumor spheroid formation and angiogenic tube formation. In addition, we found that sCD109 regulates EGFR fates by inhibiting nuclear localization of phosphorylated EGFR and promoting EGFR degradation. Additionally, sCD109 significantly reduces EGF-induced expression of cancer stem cell markers (CD44 and CD133) and embryonic stem cell markers (Nanog and Sox2), suggesting a suppressive role in cancer stemness. Taken together, these results underscore the opposing roles of mCD109 and sCD109: with sCD109 acting as an antagonist by inhibiting mCD109/EGFR-driven oncogenic signaling and phenotypes. Our current findings reveal a complex interplay among mCD109, sCD109, and EGFR, identifying a mechanism for targeting EGFRs degradation in HNSCC, and lay the groundwork for future research on investigating sCD109s modulatory role in preclinical models of HNSCC.

9

Correlation analysis of changes in the expression of C1qtnf superfamily genes in the hypothalamus, thymus, and lungs against the background of chronic social stress during the development of Lewis lung adenocarcinoma in mice

Kudryavtseva, N. N.; Smagin, D. A.; Kovalenko, I. L.; Popova, N. A.; Pavlova, M. B.

2026-07-09 cancer biology 10.64898/2026.07.02.735448 medRxiv

Top 1%

0.6%

Show abstract

It has been previously shown that chronic social defeat stress caused by paired agonistic interactions between male mice is accompanied by the development of depression-like state and immune deficiency. The aim of this study was to investigate changes in the expression of C1qtnf superfamily genes (encoding the complement component related with tumor necrosis factor) in the hypothalamus, thymus and lungs against the background of the Lewis lung adenocarcinoma growth. In the experiments, on the 5th day of social stress, male mice were injected with tumor cells into the tail vein. Chronic social stress continued for the next two weeks. The transcriptomes of the hypothalamus, thymus and lungs of mice were sequenced at the Genoanalytica Collective Center (http://genoanalytica.ru/, Moscow). Changes in the expression of the C1qtnf genes in the tissues of stressed mice were studied compared with the control and mice that were additionally injected with tumor cells. Overall, significant correlations were found between expression of most genes in each tissue of the experimental groups. In the hypothalamus of stressed animals, when tumor cells were introduced, an increase in the expression of the genes C1qtnf1, C1qtnf2, C1qtnf3, C1qtnf6 and C1qtnf7 was observed compared to controls. In the thymus of these animals, tumor cell injection increased expression of the C1qtnf1, C1qtnf5, and C1qtnf6 genes. In the lung of tumor-injected stressed mice, expression of the C1qtnf1, C1qtnf2, C1qtnf7, and C1qtnf9 genes was decreased relative to controls and non-tumor-injected depressed mice, reaching near-zero levels in some mice. Analysis of C1qtnf superfamily gene expression in the all tissues revealed negative correlations between the expression of the C1qtnf1, C1qtnf2, and C1qtnf7 genes in the hypothalamus and lungs indicating synchronization of processes against the background of social stress and Levis lung adenocarcinoma.

10

Targeting Dengue Virus NS3 Helicase: Biochemical and Computational Evaluation of Catechins from Camellia sinensis as Potential Therapeutic Leads

Wojciechowski, M. K.; Goyzueta-Mamani, L. D.; Chavez-Fumagalli, M. A.; D'Antonio, E. L.

2026-06-23 biochemistry 10.64898/2026.06.22.733882 medRxiv

Top 1%

0.6%

Show abstract

Dengue Virus Serotype 2 is a human pathogenic flavivirus that encodes a non-structural protein 3 (DEN2-NS3) containing a helicase domain essential for viral replication. DEN2-NS3 utilizes energy derived from NTP hydrolysis to unwind dsRNA and dsDNA. A galloylated catechin, (-)-epigallocatechin gallate (EGCG), was previously reported to be highly potent against the Zika Virus NS3 helicase, with an IC50 value observed at 295.7 nM. This prompted an investigation to determine if three catechins, namely, (-)-epigallocatechin (EGC), (-)-epicatechin gallate (ECG), and EGCG, would act as potent inhibitors of DEN2-NS3. Enzyme-inhibition assays revealed that the helicase catalytic domain, DEN2-NS3(S171-K618), is strongly inhibited by these galloylated catechins. We observed Ki values of 400 {+/-} 86.6 nM for EGCG (mixed-mode inhibition with respect to ATP) and 550 {+/-} 250 nM for ECG (uncompetitive inhibition with respect to ATP). Furthermore, using a computational workflow starting with SiteMap, we provide evidence that a highly druggable pocket exists within the RNA-binding cavity, involving residues ASP290, ARG387, ASP409, MET429, HIS487, ASP541, ARG599, and ASP603. These catechins were each analyzed through 200-ns molecular dynamics (MD) simulations to evaluate the binding stability within the target DEN2-NS3 binding pocket. Computational results revealed that EGCG and ECG maintained high stability, forming shared, highly persistent amino acid contacts (>45% occupancy) with ASP603, ARG599, ASP541, and ARG387. In conclusion, we have demonstrated that EGCG and ECG achieve strong binding and allosteric disruption of the critical RNA-binding channel. We suggest that future structural optimization of these compounds into stable prodrug derivatives could yield promising antiviral therapies. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=99 SRC="FIGDIR/small/733882v1_ufig1.gif" ALT="Figure 1"> View larger version (41K): org.highwire.dtl.DTLVardef@2db363org.highwire.dtl.DTLVardef@5c2fdaorg.highwire.dtl.DTLVardef@49bf8eorg.highwire.dtl.DTLVardef@1bf31f1_HPS_FORMAT_FIGEXP M_FIG C_FIG

11

Integration of lung tissue proteomics and genome-wide association data to identify lung cancer susceptibility proteins and potential drug targets

Xu, S.; Shi, J.; Shu, X.-O.; Tao, R.; Dou, Y.; Guo, X.; Wen, W.; Yang, Y.; Zhang, B.; Wu, J.; Deppen, S. A.; Li, B.; Zheng, W.; Long, J.; Cai, Q.

2026-06-22 epidemiology 10.64898/2026.06.18.26355973 medRxiv

Top 1%

0.6%

Show abstract

Background: Proteins directly impact disease development and act as drug targets. Therefore, we integrated genomic and lung tissue proteomics data to identify lung cancer susceptibility proteins, elucidating genetic mechanisms and candidate drug targets. Method: We profiled the proteome and genome in non-neoplastic lung tissue from 200 lung cancer patients. Using this data, we constructed genetic models to predict abundance across the proteome in lung tissue. We applied these models to genome-wide association study (GWAS) data from 55,174 lung cancer cases and 1,294,174 controls to evaluate their associations with the risk of lung cancer, overall and by major histological subtypes. Bayesian colocalization and Mendelian randomization (MR) analyses were used to prioritize putative causal proteins, which were cross-referenced with three main drug-protein databases to identify potential therapeutic targets. Results: We identified 29 proteins associated with lung cancer risk at a false discovery rate < 5%, including 25 for overall lung cancer, two (AQP3 and IL18) specifically for adenocarcinoma, and another two (HMGN2 and HLA-DMB) for squamous cell carcinoma. Of them, genes encoding 17 proteins reside at least 2Mb away from any known GWAS risk loci, including 14 for overall lung cancer (HYI, GPX1, GMPPB, DSP, HDDC2, MTCH2, SUOX, JMJD7, PDIA3, IL16, IQGAP1, SULT1A2, ARHGAP27, and TYMP) and three for subtypes (AQP3, IL18, and HMGN2). Among the 12 proteins located within the known risk loci, EPHX2, CLDN18, PSMD5, and CYP2S1 proteins showed an association independent of the proximal GWAS-identified lead variant. Colocalization and/or MR analysis suggested 11 potential causal proteins. Five of these candidate causal proteins (DSP, CLDN18, IQGAP1, IL18 and TYMP) are targeted by nine drugs already approved by the FDA or in phase III trials. Conclusion: Our study identified novel lung cancer susceptibility proteins and potential drug targets, offering valuable insights into lung cancer biology and future translational utilities.

12

Blood-based transcriptomic classification of lung cancer: a leakage-free nested cross-validation framework with LASSO

Bakim, S.; UrluOzalan, N.; Gulbahce Mutlu, E.; Demir, V.; Gulbahce, E.

2026-07-13 oncology 10.64898/2026.07.11.26357823 medRxiv

Top 1%

0.6%

Show abstract

Peripheral whole-blood gene expression profiling offers a minimally invasive route to lung cancer detection, but high-dimensional transcriptomic data are prone to optimistic bias when preprocessing and model selection are not properly separated from performance evaluation. We applied L1-penalised (LASSO) logistic regression to 303 peripheral whole-blood microarray profiles (123 lung cancer cases and 180 healthy controls; Gene Expression Omnibus accession GSE252168; Illumina HumanHT-12 v4) within a leakage-free nested cross-validation framework (5 outer and 3 inner folds), in which all data-dependent steps (imputation, univariate feature screening by ANOVA F-test with k = 500, and standardisation) were confined strictly to training partitions. Statistical significance was assessed by permutation testing (B = 100), and feature selection stability was quantified across outer folds. LASSO was compared with ridge logistic regression, linear support vector machines, and random forest under the same framework. The LASSO model identified a sparse 29-probe signature with a pooled out-of-fold area under the ROC curve (AUC) of 0.990 (nested estimate 0.989 +/- 0.015), accuracy 97.4%, sensitivity 94.3%, and specificity 99.4% at a 0.50 threshold; permutation testing confirmed significance (p = 0.0099). Six probes, including CDC42, U2AF1, and RPS15A, were selected in all five outer folds, forming a stable core, and all classifiers exceeded AUC 0.987, indicating a strong, algorithm-independent signal. A leakage-free nested cross-validation framework enables unbiased performance estimation and reproducible feature selection in blood-based lung cancer classification. The 29-probe panel is an internally validated candidate requiring prospective, multicentre external validation before clinical use.

13

Homology-aware cross-validation strategies for generalization assessment in RNA structure prediction

Bugnon, L.; Kulemeyer, G.; Gerard, M.; Di Persia, L.; Stegmayer, G.; Milone, D. H.

2026-06-29 bioinformatics 10.64898/2026.06.28.735057 medRxiv

Top 1%

0.6%

Show abstract

RNA secondary structure prediction is a fundamental challenge in bioinformatics, essential for understanding the functional roles of non-coding RNAs. Recently, deep learning models have transformed the field with impressive results, leading to critical discussions regarding the validity of current cross-validation strategies. On the one hand, traditional random partitioning yields overop-timistic results due to data leakage from uncontrolled homology. On the other hand, removing from the training set all sequences that exhibit even the slightest resemblance to the testing sequences penalizes learning-based methods by requiring generalization to completely out-of-distribution sequences. While it is very simple to remove sequences and retrain a machine learned model, it is very difficult to remove the experimental data used for parameter tuning and the sequences used for the development of classical thermodynamic methods. Thus, these methods often benefit from an implicit knowledge leakage. In this work we critically review existing cross-validation strategies for RNA secondary structure prediction: random splitting, clustering-based splitting, and leaving one RNA family out for testing. We analyze the advantages and limitations of each strategy, also expanding them towards the future directions to ensure fair comparisons across the full range of sequence similarities, with the same rigor for both classical and learning-based methods.

14

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 1%

0.5%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

15

Selenium-enriched rapeseed extract synergizes with chemotherapy drug cisplatin in inhibiting proliferation and promoting apoptosis of colorectal cancer cells

Duan, X.; Lu, Y.; Zhou, H.; Zhang, Z.; Zhou, Z.; Wang, M.; Dun, X.; Chen, Z.; Zhu, Y.; Wang, H.; Jiang, L.

2026-07-10 cancer biology 10.64898/2026.07.06.736755 medRxiv

Top 2%

0.5%

Show abstract

Chemotherapy treatment of colorectal cancers (CRC) using cisplatin (CDDP) encounters problems of drug resistance by the cancer cells and cytotoxicity to normal cells, highlighting the urgent need for joint therapeutical strategies. Selenium-enriched rapeseed extracts exhibit anti-cancer effects but the bioactive components and mechanisms remain unclear. Here, we applied different solvents to fractionate the extracts from Selenium-enriched rapeseed and found that the water extract (WE) fraction significantly enhanced the cytotoxic effect of CDDP on cancer cells but no damage on normal cells. HPLC-ICP-MS analysis revealed that methylselenocysteine (MSC) and selenocystine (SeCys2) were the main selenium speciation in WE. Through cell biology and integrative multi-omics analysis, we found a synergistic anti-CRC cell effect when combining CDDP with MSC, sulforaphane (SFN), celastrol (Cel), Indole-3-carbinol (I3C), -linolenic acid (ALA) or linoleic acid (LA). We propose that the CDDP-WE combination treatment holds the promise for improving curative efficacy for chemo-refractory CRC patients in the future.

16

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 2%

0.4%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

17

BioMetAll v2.0: Introducing Scores, Metal Discrimination, and Side-Chain Descriptors for Predicting Metal-Binding Sites in Proteins.

Marechal, J. D.; Fernandez Diaz, R.; Pena Losada, R.; Sanchez Aparicio, J. E.; Gao, W.; Alemany, M.

2026-07-12 bioinformatics 10.64898/2026.07.09.737562 medRxiv

Top 2%

0.4%

Show abstract

Predicting the location of metal-binding sites in proteins is crucial for fundamental biological questions and biotechnological applications. Over the past decade, the rise in metal-bound protein structures in the Protein Data Bank, combined with advanced statistical models such as deep learning, has accelerated the development of metal-binding site prediction tools. Several approaches are now available, offering high-quality benchmarks and predictive performance. Our initial development in this area is BioMetAll, whose first version was based on backbone pre-organization. Here, we introduce its second version, featuring two major updates: 1) metal-specific scoring functions and 2) prediction using backbone geometry alone or in combination with first coordination sphere descriptors. Apart from demonstrating metal sensitivity and yielding better benchmarking results, this new version allows the assessment of the influence of considering the metals first coordination sphere versus backbone pre-organization on how metallic species bind to proteins.

18

EnzyKAN: Protein Language Model Embeddings and Kolmogorov-Arnold Network Variants for Enzyme Commission Classification with a Proposed Electron-Transfer Physics Feature Framework

R, S.; Reddy, B. R. R.

2026-06-29 bioinformatics 10.64898/2026.06.23.734004 medRxiv

Top 2%

0.4%

Show abstract

MotivationComputational enzyme classification has previously utilised sequence homology features and protein language model embeddings. The Kolmogorov-Arnold Network (KAN) paradigm, which uses learnable edge functions rather than fixed ones, has shown promising results in biological sequence tasks. ResultsA fully reproducible investigation of KAN variants for seven-class EC classification on up to 9,516 labelled sequences from the CLEAN benchmark [1] (9,386 for language model experiments). In the sequence only settings, fixed basis KAN variants outperformed an MLP baseline moderately (macro F1 = 0.17-0.29). Utilisation of ESM-2 650M embeddings [2] greatly improved results via 5-fold cross-validation: MLP macro F1 = 0.750 {+/-} 0.009, accuracy = 0.823 {+/-} 0.009; learnable SineKAN macro F1 = 0.716 {+/-} 0.023, accuracy = 0.788 {+/-} 0.019. MLP performed comparably but did not exceed conventional baselines. As an aside, we introduce but do not investigate an approach to EC oxidoreductase sub-classification through the use of a Marcus theory-based electron transfer feature framework. AvailabilityCode and result files are available at https://github.com/sanjuz-cas/ENZYKAN.

19

Protein hydration and druggability

Panasenko, S.; Khorev, V.; Petukhov, M.

2026-07-08 biophysics 10.64898/2026.07.06.736750 medRxiv

Top 2%

0.4%

Show abstract

A priori assessment of target proteins' druggability remains an unsolved problem in the field of drug development. The empirical approaches widely used to solve this problem demonstrate low efficiency. In this work, we investigated the factor of hydration of a representative set of 65 evolutionarily and structurally unrelated human enzymes in a water environment. This factor depends only on the structure of the proteins, and not on the physical and chemical properties of any potential ligands. The results show that, unlike the widely used approaches based on calculations of the accessible surface area (ASA), the content of low-entropy water molecules (LEW) in the active sites of human enzymes is systematically higher than that in other areas of their surface, including inactive cavities. Optimal criteria and a step-by-step procedure for identifying protein ligand binding sites are proposed. The proposed approach, based on the calculation of the LEW content in the first hydration layer of potentially interesting target proteins, makes it possible to evaluate their medicinal suitability even before the development of any ligands. The article also presents the results of a comparative analysis of experimental Raman spectroscopy data and the results of molecular dynamics simulations of water hydrogen bonds using three widely used water models (TIP3P, OPC3, and TIP5P) and standard algorithms for calculating hydrogen bond networks.

20

AptBacterialDB: A Comprehensive, Manually Curated Database of Antibacterial Aptamers

Bajiya, N.; Gupta, I.; Raghava, G. P. S.

2026-07-03 bioinformatics 10.64898/2026.07.01.735956 medRxiv

Top 2%

0.4%

Show abstract

In recent years, aptamers have transitioned from mere laboratory tools to highly potent molecular recognition agents capable of overcoming the strict limitations of conventional antibiotic therapies. We have developed AptBacterialDB, a manually curated, large, comprehensive database of experimentally validated antibacterial aptamers spanning 1996 to 2026. The database contains a total of 2131 aptamers targeting approx 75 different bacterial classes, and 124 aptamer targets with 95 entries found in UTexas databases, 97 in AptaDB, and 28 in Aptabase. It contains 1555 unique aptamer sequences, 189 unique modifications, 40 different selection approaches, and 44 different affinity methods. It integrates detailed annotations of about 20 fields, including sequence information, nucleic acid type, binding affinity, modifications, experimental and functional details. The secondary structure of the aptamers was predicted using ViennaRNA Package 2.0, demonstrating that they adopt mostly stable conformations, with a structured stem region. MySQL was implemented for database development, and a knowledge graph was integrated using ArcadeDB/openCypher for graphical visualization of aptamer-target-organisation relationships. Facilities such as different search modes, browsing, similarity search, REST API access, and entries linked to the existing database for a broader view of the aptamers have been provided. AptBacterialDB (https://webs.iiitd.edu.in/raghava/aptbacterialdb/) provides a user-friendly centralized platform to accelerate antibacterial aptamer research, therapeutic development, biosensor design, and computational modelling efforts.